Using Collocations and K-means Clustering to Improve the N-pos Model for Japanese IME

نویسندگان

  • Long Chen
  • X ianchao Wu
  • J ingzhou He
چکیده

Kana-Kanji conversion is known as one of the representative applications of Natural Language Processing (NLP) for the Japanese language. The N-pos model, presenting the probability of a Kanji candidate sequence by the product of bi-gram Part-of-Speech (POS) probabilities and POS-to-word emission probabilities, has been successfully applied in a number of well-known Japanese Input Method Editor (IME) systems. However, since N-pos model is an approximation of n-gram word-based language model, important word-to-word collocation information are lost during this compression and lead to a drop of the conversion accuracies. In order to overcome this problem, we propose ways to improve current N-pos model. One way is to append the highfrequency collocations and the other way is to sub-categorize the huge POS sets to make them more representative. Experiments on large-scale data verified our proposals.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Customer behavior mining based on RFM model to improve the customer relationship management

Companies’ managers are very enthusiastic to extract the hidden and valuable knowledge from their organization data. Data mining is a new and well-known technique, which can be implemented on customers data and discover the hidden knowledge and information from customers' behaviors. Organizations use data mining to improve their customer relationship management processes. In this paper R, F, an...

متن کامل

Using the Web to Train a Mobile Device Oriented Japanese Input Method Editor

This paper describes the construction of a Japanese Input Method Editor (IME) system for mobile devices, using the largescale Web pages. We provide the training process of our IME model, n-pos model for local Kana-Kanji conversion and ngram model for online cloud service. Especially, we propose an online algorithm of mining new compound words, together with the detailed post-filtering process t...

متن کامل

An Improved K-Means with Artificial Bee Colony Algorithm for Clustering Crimes

Crime detection is one of the major issues in the field of criminology. In fact, criminology includes knowing the details of a crime and its intangible relations with the offender. In spite of the enormous amount of data on offenses and offenders, and the complex and intangible semantic relationships between this information, criminology has become one of the most important areas in the field o...

متن کامل

A Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS

Data clustering is the process of partitioning a set of data objects into meaning clusters or groups. Due to the vast usage of clustering algorithms in many fields, a lot of research is still going on to find the best and efficient clustering algorithm. K-means is simple and easy to implement, but it suffers from initialization of cluster center and hence trapped in local optimum. In this paper...

متن کامل

An Efficient Predictive Model for Probability of Genetic Diseases Transmission Using a Combined Model

In this article, a new combined approach of a decision tree and clustering is presented to predict the transmission of genetic diseases. In this article, the performance of these algorithms is compared for more accurate prediction of disease transmission under the same condition and based on a series of measures like the positive predictive value, negative predictive value, accuracy, sensitivit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013